https://besjournals.onlinelibrary.wiley.com/doi/10.1111/j.1365-2656.2008.01390.x
Load the necessary libraries
library(gbm) #for gradient boosted models
library(car)
library(dismo)
library(pdp)
library(ggfortify)
library(randomForest)
library(tidyverse)
library(gridExtra)
library(patchwork)
Abalone are an important aquaculture shell fish that are farmed for both their meat and their shells. Abalone can live up to 50 years, although their longevity is known to be influenced by a range of environmental factors. Traditionally, abalone are aged by counting thier growth rings, however, this method is very laborious and expensive. Hence a study was conducted in which abalone growth ring counts were matched up with a range of other more easily measured physical characteristics (such as shell dimesions and weights) in order to see if any of these other parameters could be used as proxies for the number of growth rings (or age).
abalone
Format of abalone.csv data file
abalone = read_csv('../public/data/abalone.csv', trim_ws=TRUE)
glimpse(abalone)
## Rows: 4,177
## Columns: 10
## $ SEX <chr> "M", "M", "F", "M", "I", "I", "F", "F", "M", "F", "F", "M…
## $ LENGTH <dbl> 0.455, 0.350, 0.530, 0.440, 0.330, 0.425, 0.530, 0.545, 0…
## $ DIAMETER <dbl> 0.365, 0.265, 0.420, 0.365, 0.255, 0.300, 0.415, 0.425, 0…
## $ HEIGHT <dbl> 0.095, 0.090, 0.135, 0.125, 0.080, 0.095, 0.150, 0.125, 0…
## $ WHOLE_WEIGHT <dbl> 0.5140, 0.2255, 0.6770, 0.5160, 0.2050, 0.3515, 0.7775, 0…
## $ MEAT_WEIGHT <dbl> 0.2245, 0.0995, 0.2565, 0.2155, 0.0895, 0.1410, 0.2370, 0…
## $ GUT_WEIGHT <dbl> 0.1010, 0.0485, 0.1415, 0.1140, 0.0395, 0.0775, 0.1415, 0…
## $ SHELL_WEIGHT <dbl> 0.150, 0.070, 0.210, 0.155, 0.055, 0.120, 0.330, 0.260, 0…
## $ RINGS <dbl> 15, 7, 9, 10, 7, 8, 20, 16, 9, 19, 14, 10, 11, 10, 10, 12…
## $ AGE <dbl> 16.5, 8.5, 10.5, 11.5, 8.5, 9.5, 21.5, 17.5, 10.5, 20.5, …
abalone = abalone %>% mutate(SEX=factor(SEX))
ggplot(abalone) +
geom_point(aes(y=AGE, x=RINGS))
## Very tight relationship
scatterplotMatrix(~RINGS + SEX +LENGTH+DIAMETER+HEIGHT+WHOLE_WEIGHT+MEAT_WEIGHT+GUT_WEIGHT+
SHELL_WEIGHT, data=abalone)
We will start by partitioning the data into training and test sets. The purpose of partitioning is to provide one large set of data that can be used to train the regression trees and another smaller one that can be used to evaluate its accuracy.
set.seed(123)
nrow(abalone)
## [1] 4177
i = sample(1:nrow(abalone), 100,replace=FALSE)
abalone.train = abalone[-i,]
abalone.test = abalone[i,]
Now we will specify the gradient boosted model with:
The values of n.trees, interaction.depth and shrinkage used below are purely based on what are typically good starting points. Nevertheless, they will be data set dependent. We will start off with those values and then evaluate whether they are appropriate.
abalone.gbm = gbm(RINGS ~ SEX + LENGTH + DIAMETER + HEIGHT +
WHOLE_WEIGHT + MEAT_WEIGHT + GUT_WEIGHT +
SHELL_WEIGHT,
data=abalone,
distribution='poisson',
var.monotone=c(0,1,1,1,1,1,1,1),
n.trees=10000,
interaction.depth=5,
bag.fraction=0.5,
shrinkage=0.01,
train.fraction=1,
cv.folds=3)
We will now determine the optimum number of trees estimated to be required in order to achieve a balance between bias (biased towards the exact observations) and precision (variability in estimates). Ideally, the optimum number of trees should be close to 1000. If it is much less (as in this case), it could imply that the tree learned too quickly. On the other hand, if the optimum number of trees is very close to the total number of fitted trees, then it suggests that the optimum may not actually have occured yet and that more trees should be used (or a faster learning rate).
(best.iter = gbm.perf(abalone.gbm,method='OOB'))
## [1] 265
## attr(,"smoother")
## Call:
## loess(formula = object$oobag.improve ~ x, enp.target = min(max(4,
## length(x)/10), 50))
##
## Number of Observations: 10000
## Equivalent Number of Parameters: 39.99
## Residual Standard Error: 8.399e-05
(best.iter = gbm.perf(abalone.gbm,method='cv'))
## [1] 377
Conclusions:
abalone.gbm = gbm(RINGS ~ SEX + LENGTH + DIAMETER + HEIGHT +
WHOLE_WEIGHT + MEAT_WEIGHT + GUT_WEIGHT +
SHELL_WEIGHT,
data=abalone,
distribution='poisson',
var.monotone=c(0,1,1,1,1,1,1,1),
n.trees=5000,
interaction.depth=5,
bag.fraction=0.5,
shrinkage=0.001,
train.fraction=1,
cv.folds=3)
(best.iter = gbm.perf(abalone.gbm,method='OOB'))
## [1] 3497
## attr(,"smoother")
## Call:
## loess(formula = object$oobag.improve ~ x, enp.target = min(max(4,
## length(x)/10), 50))
##
## Number of Observations: 5000
## Equivalent Number of Parameters: 39.99
## Residual Standard Error: 3.931e-06
(best.iter = gbm.perf(abalone.gbm,method='cv'))
## [1] 3799
Conclusions:
If a predictor is an important driver of the patterns in the response, then many of the tree splits should feature this predictor. It thus follows that the number of proportion of total splits that features each predictor will be a measure of the relative influence of each of the predictors.
summary(abalone.gbm, n.trees=best.iter)
Conclusions:
attr(abalone.gbm$Terms,"term.labels")
## [1] "SEX" "LENGTH" "DIAMETER" "HEIGHT" "WHOLE_WEIGHT"
## [6] "MEAT_WEIGHT" "GUT_WEIGHT" "SHELL_WEIGHT"
plot(abalone.gbm, 8, n.tree=best.iter)
abalone.gbm %>%
pdp::partial(pred.var='SHELL_WEIGHT',
n.trees=best.iter,
recursive=FALSE,
inv.link=exp) %>%
autoplot()
Recursive indicates that a weighted tree traversal method described by Friedman 2001 (which is very fast) should be used (only works for gbm). Otherwise a slower brute force method is used. If want to back transform - need to use brute force.
nms <- attr(abalone.gbm$Terms,"term.labels")
p <- vector('list', length(nms))
names(p) <- nms
for (nm in nms) {
print(nm)
p[[nm]] <- abalone.gbm %>% pdp::partial(pred.var=nm,
n.trees=best.iter,
inv.link=exp,
recursive=FALSE,
type='regression') %>%
autoplot() + ylim(0, 20)
}
## [1] "SEX"
## [1] "LENGTH"
## [1] "DIAMETER"
## [1] "HEIGHT"
## [1] "WHOLE_WEIGHT"
## [1] "MEAT_WEIGHT"
## [1] "GUT_WEIGHT"
## [1] "SHELL_WEIGHT"
patchwork::wrap_plots(p)
do.call('grid.arrange', p)
We might also want to explore interactions…
abalone.gbm %>%
pdp::partial(pred.var=c('SHELL_WEIGHT'),
n.trees=best.iter, recursive=FALSE, inv.link=exp) %>%
autoplot()
abalone.gbm %>%
pdp::partial(pred.var=c('SHELL_WEIGHT','HEIGHT'),
n.trees=best.iter, recursive=TRUE) %>%
autoplot()
g1 = abalone.gbm %>% pdp::partial(pred.var='SHELL_WEIGHT', n.trees=best.iter,
recursive=FALSE,inv.link=exp) %>%
autoplot
g2 = abalone.gbm %>% pdp::partial(pred.var='HEIGHT', n.trees=best.iter,
recursive=FALSE,inv.link=exp) %>%
autoplot
g1 + g2
abalone.acc <- abalone.test %>%
bind_cols(Pred = predict(abalone.gbm,
newdata=abalone.test,
n.tree=best.iter,
type='response'))
with(abalone.acc, cor(RINGS, Pred))
## [1] 0.6926778
abalone.acc %>%
ggplot() +
geom_point(aes(y=Pred, x=RINGS))
Computes Friedman’s H-statistic to assess the strength of variable interactions. This measures the relative strength of interactions in models It is on a scale of 0-1, where 1 is very strong interaction In y=β_0+β_1x_1+β_2x_2+β_3x_3.. H= If both main effects are weak, then the H- stat will be unstable.. and could indicate a strong interaction.
What were the strong main effects: - SHELL_WEIGHT - HEIGHT - SEX
attr(abalone.gbm$Terms,"term.labels")
## [1] "SEX" "LENGTH" "DIAMETER" "HEIGHT" "WHOLE_WEIGHT"
## [6] "MEAT_WEIGHT" "GUT_WEIGHT" "SHELL_WEIGHT"
interact.gbm(abalone.gbm, abalone,c(1,4), n.tree=best.iter)
## [1] 0.4963162
interact.gbm(abalone.gbm, abalone,c(4,8), n.tree=best.iter)
## [1] 0.073613
interact.gbm(abalone.gbm, abalone,c(4,8), n.tree=best.iter)
## [1] 0.073613
interact.gbm(abalone.gbm, abalone,c(1,4,8), n.tree=best.iter)
## [1] 0.0220495
abalone.gbm %>% pdp::partial(pred.var=c(1), n.trees=best.iter, recursive=FALSE) %>% autoplot
abalone.gbm %>% pdp::partial(pred.var=c(1, 4), n.trees=best.iter, recursive=FALSE) %>% autoplot
abalone.gbm %>% pdp::partial(pred.var=c(1, 8), n.trees=best.iter, recursive=FALSE) %>% autoplot
#plot(abalone.gbm, c(1,4), n.tree=best.iter)
#plot(abalone.gbm, c(5,6), n.tree=best.iter)
abalone.grid = plot(abalone.gbm, c(1,4), n.tree=best.iter, return.grid=TRUE)
head(abalone.grid)
ggplot(abalone.grid, aes(y=HEIGHT, x=SEX)) +
geom_tile(aes(fill=y)) +
geom_contour(aes(z=y)) +
scale_fill_gradientn(colors=heat.colors(10))
## [1] "i= 1 Name = SEX"
## [1] "j= 2 Name = LENGTH"
## [1] "i= 1 Name = SEX"
## [1] "j= 3 Name = DIAMETER"
## [1] "i= 1 Name = SEX"
## [1] "j= 4 Name = HEIGHT"
## [1] "i= 1 Name = SEX"
## [1] "j= 5 Name = WHOLE_WEIGHT"
## [1] "i= 1 Name = SEX"
## [1] "j= 6 Name = MEAT_WEIGHT"
## [1] "i= 1 Name = SEX"
## [1] "j= 7 Name = GUT_WEIGHT"
## [1] "i= 1 Name = SEX"
## [1] "j= 8 Name = SHELL_WEIGHT"
## [1] "i= 2 Name = LENGTH"
## [1] "j= 3 Name = DIAMETER"
## [1] "i= 2 Name = LENGTH"
## [1] "j= 4 Name = HEIGHT"
## [1] "i= 2 Name = LENGTH"
## [1] "j= 5 Name = WHOLE_WEIGHT"
## [1] "i= 2 Name = LENGTH"
## [1] "j= 6 Name = MEAT_WEIGHT"
## [1] "i= 2 Name = LENGTH"
## [1] "j= 7 Name = GUT_WEIGHT"
## [1] "i= 2 Name = LENGTH"
## [1] "j= 8 Name = SHELL_WEIGHT"
## [1] "i= 3 Name = DIAMETER"
## [1] "j= 4 Name = HEIGHT"
## [1] "i= 3 Name = DIAMETER"
## [1] "j= 5 Name = WHOLE_WEIGHT"
## [1] "i= 3 Name = DIAMETER"
## [1] "j= 6 Name = MEAT_WEIGHT"
## [1] "i= 3 Name = DIAMETER"
## [1] "j= 7 Name = GUT_WEIGHT"
## [1] "i= 3 Name = DIAMETER"
## [1] "j= 8 Name = SHELL_WEIGHT"
## [1] "i= 4 Name = HEIGHT"
## [1] "j= 5 Name = WHOLE_WEIGHT"
## [1] "i= 4 Name = HEIGHT"
## [1] "j= 6 Name = MEAT_WEIGHT"
## [1] "i= 4 Name = HEIGHT"
## [1] "j= 7 Name = GUT_WEIGHT"
## [1] "i= 4 Name = HEIGHT"
## [1] "j= 8 Name = SHELL_WEIGHT"
## [1] "i= 5 Name = WHOLE_WEIGHT"
## [1] "j= 6 Name = MEAT_WEIGHT"
## [1] "i= 5 Name = WHOLE_WEIGHT"
## [1] "j= 7 Name = GUT_WEIGHT"
## [1] "i= 5 Name = WHOLE_WEIGHT"
## [1] "j= 8 Name = SHELL_WEIGHT"
## [1] "i= 6 Name = MEAT_WEIGHT"
## [1] "j= 7 Name = GUT_WEIGHT"
## [1] "i= 6 Name = MEAT_WEIGHT"
## [1] "j= 8 Name = SHELL_WEIGHT"
## [1] "i= 7 Name = GUT_WEIGHT"
## [1] "j= 8 Name = SHELL_WEIGHT"
The takes a long time - do over a break
abalone.gbm1 <- gbm.step(data=abalone %>% as.data.frame, gbm.x=1:8, gbm.y=9,
tree.complexity=5,
learning.rate=0.001,
bag.fraction=0.5,
n.trees=10000,
family='poisson')
##
##
## GBM STEP - version 2.9
##
## Performing cross-validation optimisation of a boosted regression tree model
## for RINGS and using a family of poisson
## Using 4177 observations and 8 predictors
## creating 10 initial models of 10000 trees
##
## folds are unstratified
## total mean deviance = 0.991
## tolerance is fixed at 0.001
## ntrees resid. dev.
## 10000 0.3983
## now adding trees...
##
## mean total deviance = 0.991
## mean residual deviance = 0.337
##
## estimated cv deviance = 0.398 ; se = 0.017
##
## training data correlation = 0.796
## cv correlation = 0.745 ; se = 0.012
##
## elapsed time - 0.06 minutes
##
## ########### warning ##########
##
## maximum tree limit reached - results may not be optimal
## - refit with faster learning rate or increase maximum number of trees
summary(abalone.gbm1)
gbm.plot(abalone.gbm1, n.plots=8, write.title = FALSE)
gbm.plot.fits(abalone.gbm1)
find.int <- gbm.interactions(abalone.gbm1)
summary(find.int)
## Length Class Mode
## rank.list 5 data.frame list
## interactions 64 -none- numeric
## gbm.call 21 -none- list
find.int$rank.list
gbm.perspec(abalone.gbm1,6,5)
nBoot <- 10
abalone.pred <- with(abalone,
expand.grid(SHELL_WEIGHT = modelr::seq_range(SHELL_WEIGHT, n = 100),
SEX = levels(SEX),
LENGTH = NA,
DIAMETER = NA,
HEIGHT = NA,
WHOLE_WEIGHT = NA,
MEAT_WEIGHT = NA,
GUT_WEIGHT = NA)
)
abalone.list <- vector('list', nBoot)
abalone.list
## [[1]]
## NULL
##
## [[2]]
## NULL
##
## [[3]]
## NULL
##
## [[4]]
## NULL
##
## [[5]]
## NULL
##
## [[6]]
## NULL
##
## [[7]]
## NULL
##
## [[8]]
## NULL
##
## [[9]]
## NULL
##
## [[10]]
## NULL
abalone.sum <- vector('list', nBoot)
for (i in 1:nBoot) {
print(paste0('Boot number: ', i))
## Create random set
abalone.rnd <- abalone %>%
sample_n(size = n(), replace=TRUE)
## Fit the trees
abalone.gbm = gbm(RINGS ~ SEX + LENGTH + DIAMETER + HEIGHT +
WHOLE_WEIGHT + MEAT_WEIGHT + GUT_WEIGHT +
SHELL_WEIGHT,
data=abalone.rnd,
distribution='poisson',
var.monotone=c(0,1,1,1,1,1,1,1),
n.trees=5000,
interaction.depth=5,
bag.fraction=0.5,
shrinkage=0.001,
train.fraction=1,
cv.folds=3)
## Determine the best number of trees
(best.iter = gbm.perf(abalone.gbm,method='cv'))
## predict based on shell weight
fit <- predict(abalone.gbm, newdata = abalone.pred, n.trees = best.iter) %>% exp()
abalone.list[[i]] <- data.frame(abalone.pred, Boot = i, Fit = fit)
## relative influence
abalone.sum[[i]] <- summary(abalone.gbm, n.trees = best.iter)
}
## [1] "Boot number: 1"
## [1] "Boot number: 2"
## [1] "Boot number: 3"
## [1] "Boot number: 4"
## [1] "Boot number: 5"
## [1] "Boot number: 6"
## [1] "Boot number: 7"
## [1] "Boot number: 8"
## [1] "Boot number: 9"
## [1] "Boot number: 10"
abalone.fit <- do.call('rbind', abalone.list)
abalone.fit <- abalone.fit %>%
group_by(SHELL_WEIGHT, SEX) %>%
## summarise(Median = median(Fit),
## Lower = quantile(Fit, p=0.025),
## Upper = quantile(Fit, p=0.975))
ggdist::median_hdci(Fit)
g1 <- abalone.fit %>% ggplot(aes(y=Fit, x=SHELL_WEIGHT, fill=SEX, color=SEX)) +
geom_ribbon(aes(ymin=.lower, ymax=.upper), alpha=0.3, color=NA) +
geom_line() +
scale_fill_viridis_d() +
scale_colour_viridis_d() +
theme_classic()
abalone.inf <- do.call('rbind', abalone.sum)
abalone.inf <- abalone.inf %>%
group_by(var) %>%
ggdist::median_hdci(rel.inf)
g2 <- abalone.inf %>% ggplot(aes(y=var, x=rel.inf)) +
geom_vline(xintercept=12.5, linetype='dashed') +
geom_pointrange(aes(xmin=.lower, xmax=.upper)) +
theme_classic()
g2 + patchwork::inset_element(g1, left=0.5, bottom=0.01, right=1, top=0.7)
g1 + patchwork::inset_element(g2, left=0.5, bottom=0.01, right=1, top=0.5)
library(randomForest)
abalone.rf = randomForest(RINGS ~ SEX + LENGTH + DIAMETER + HEIGHT +
WHOLE_WEIGHT + MEAT_WEIGHT + GUT_WEIGHT + SHELL_WEIGHT,
data=abalone, importance=TRUE,
ntree=1000)
abalone.imp = randomForest::importance(abalone.rf)
## Rank by either:
## *MSE (mean decrease in accuracy)
## For each tree, calculate OOB prediction error.
## This also done after permuting predictors.
## Then average diff of prediction errors for each tree
## *NodePurity (mean decrease in node impurity)
## Measure of the total decline of impurity due to each
## predictor averaged over trees
100*abalone.imp/sum(abalone.imp)
## %IncMSE IncNodePurity
## SEX 0.12480334 3.007474
## LENGTH 0.06201705 8.715762
## DIAMETER 0.06745577 10.497484
## HEIGHT 0.08890620 13.050985
## WHOLE_WEIGHT 0.06848899 14.899248
## MEAT_WEIGHT 0.13732241 13.552916
## GUT_WEIGHT 0.07246091 12.110066
## SHELL_WEIGHT 0.10750320 23.437107
varImpPlot(abalone.rf)
## use brute force
abalone.rf %>% pdp::partial('SHELL_WEIGHT') %>% autoplot